10 research outputs found

    Adaptive Prefetching and Cache Partitioning for Multicore Processors

    Get PDF
    El acceso a la memoria principal en los procesadores actuales supone un importante cuello de botella para las prestaciones, dado que los diferentes núcleos compiten por el limitado ancho de banda de memoria, agravando la brecha entre las prestaciones del procesador y las de la memoria principal. Distintas técnicas atacan este problema, siendo las más relevantes el uso de jerarquías de caché multinivel y la prebúsqueda. Las cachés jerárquicas aprovechan la localidad temporal y espacial que en general presentan los programas en el acceso a los datos, para mitigar las enormes latencias de acceso a memoria principal. Para limitar el número de accesos a la memoria DRAM, fuera del chip, los procesadores actuales cuentan con grandes cachés de último nivel (LLC). Para mejorar su utilización y reducir costes, estas cachés suelen compartirse entre todos los núcleos del procesador. Este enfoque mejora significativamente el rendimiento de la mayoría de las aplicaciones en comparación con el uso de cachés privados más pequeños. Compartir la caché, sin embargo, presenta una problema importante: la interferencia entre aplicaciones. La prebúsqueda, por otro lado, trae bloques de datos a las cachés antes de que el procesador los solicite, ocultando la latencia de memoria principal. Desafortunadamente, dado que la prebúsqueda es una técnica especulativa, si no tiene éxito puede contaminar la caché con bloques que no se usarán. Además, las prebúsquedas interfieren con los accesos a memoria normales, tanto los del núcleo que emite las prebúsquedas como los de los demás. Esta tesis se centra en reducir la interferencia entre aplicaciones, tanto en las caché compartidas como en el acceso a la memoria principal. Para reducir la interferencia entre aplicaciones en el acceso a la memoria principal, el mecanismo propuesto en esta disertación regula la agresividad de cada prebuscador, activando o desactivando selectivamente algunos de ellos, dependiendo de su rendimiento individual y de los requisitos de ancho de banda de memoria principal de los otros núcleos. Con respecto a la interferencia en cachés compartidos, esta tesis propone dos técnicas de particionado para la LLC, las cuales otorgan más espacio de caché a las aplicaciones que progresan más lentamente debido a la interferencia entre aplicaciones. La primera propuesta de particionado de caché requiere hardware específico no disponible en procesadores comerciales, por lo que se ha evaluado utilizando un entorno de simulación. La segunda propuesta de particionado de caché presenta una familia de políticas que superan las limitaciones en el número de particiones y en el número de vías de caché disponibles mediante la agrupación de aplicaciones en clústeres y la superposición de particiones de caché, por lo que varias aplicaciones comparten las mismas vías. Dado que se ha implementado utilizando los mecanismos para el particionado de la LLC que presentan algunos procesadores Intel modernos, esta propuesta ha sido evaluada en una máquina real. Los resultados experimentales muestran que el mecanismo de prebúsqueda selectiva propuesto en esta tesis reduce el número de solicitudes de memoria principal en un 20%, cosa que se traduce en mejoras en la equidad del sistema, el rendimiento y el consumo de energía. Por otro lado, con respecto a los esquemas de partición propuestos, en comparación con un sistema sin particiones, ambas propuestas reducen la iniquidad del sistema en un promedio de más del 25%, independientemente de la cantidad de aplicaciones en ejecución, y esta reducción en la injusticia no afecta negativamente al rendimiento.Accessing main memory represents a major performance bottleneck in current processors, since the different cores compete among them for the limited offchip bandwidth, aggravating even more the so called memory wall. Several techniques have been applied to deal with the core-memory performance gap, with the most preeminent ones being prefetching and hierarchical caching. Hierarchical caches leverage the temporal and spacial locality of the accessed data, mitigating the huge main memory access latencies. To limit the number of accesses to the off-chip DRAM memory, current processors feature large Last Level Caches. These caches are shared between all the cores to improve the utilization of the cache space and reduce cost. This approach significantly improves the performance of most applications compared to using smaller private caches. Cache sharing, however, presents an important shortcoming: the interference between applications. Prefetching, on the other hand, brings data blocks to the caches before they are requested, hiding the main memory latency. Unfortunately, since prefetching is a speculative technique, inaccurate prefetches may pollute the cache with blocks that will not be used. In addition, the prefetches interfere with the regular memory requests, both the ones from the application running on the core that issued the prefetches and the others. This thesis focuses on reducing the inter-application interference, both in the shared cache and in the access to the main memory. To reduce the interapplication interference in the access to main memory, the proposed approach regulates the aggressiveness of each core prefetcher, and selectively activates or deactivates some of them, depending on their individual performance and the main memory bandwidth requirements of the other cores. With respect to interference in shared caches, this thesis proposes two LLC partitioning techniques that give more cache space to the applications that have their progress diminished due inter-application interferences. The first cache partitioning proposal requires dedicated hardware not available in commercial processors, so it has been evaluated using a simulation framework. The second proposal dealing with cache partitioning presents a family of partitioning policies that overcome the limitations in the number of partitions and the number of available ways by grouping applications and overlapping cache partitions, so multiple applications share the same ways. Since it has been implemented using the cache partitioning features of modern Intel processors it has been evaluated in a real machine. Experimental results show that the proposed selective prefetching mechanism reduces the number of main memory requests by 20%, which translates to improvements in unfairness, performance, and energy consumption. On the other hand, regarding the proposed partitioning schemes, compared to a system with no partitioning, both reduce unfairness more than 25% on average, regardless of the number of applications running in the multicore, and this reduction in unfairness does not negatively affect the performance.L'accés a la memòria principal en els processadors actuals suposa un important coll d'ampolla per a les prestacions, ja que els diferents nuclis competeixen pel limitat ample de banda de memòria, agreujant la bretxa entre les prestacions del processador i les de la memòria principal. Diferents tècniques ataquen aquest problema, sent les més rellevants l'ús de jerarquies de memòria cau multinivell i la prebusca. Les memòries cau jeràrquiques aprofiten la localitat temporal i espacial que en general presenten els programes en l'accés a les dades per mitigar les enormes latències d'accés a memòria principal. Per limitar el nombre d'accessos a la memòria DRAM, fora del xip, els processadors actuals compten amb grans caus d'últim nivell (LLC). Per millorar la seva utilització i reduir costos, aquestes memòries cau solen compartir-se entre tots els nuclis del processador. Aquest enfocament millora significativament el rendiment de la majoria de les aplicacions en comparació amb l'ús de caus privades més menudes. Compartir la memòria cau, no obstant, presenta una problema important: la interferencia entre aplicacions. La prebusca, per altra banda, porta blocs de dades a les memòries cau abans que el processador els sol·licite, ocultant la latència de memòria principal. Desafortunadament, donat que la prebusca és una técnica especulativa, si no té èxit pot contaminar la memòria cau amb blocs que no fan falta. A més, les prebusques interfereixen amb els accessos normals a memòria, tant els del nucli que emet les prebusques com els dels altres. Aquesta tesi es centra en reduir la interferència entre aplicacions, tant en les cau compartides com en l'accés a la memòria principal. Per reduir la interferència entre aplicacions en l'accés a la memòria principal, el mecanismo proposat en aquesta dissertació regula l'agressivitat de cada prebuscador, activant o desactivant selectivament alguns d'ells, en funció del seu rendiment individual i dels requisits d'ample de banda de memòria principal dels altres nuclis. Pel que fa a la interferència en caus compartides, aquesta tesi proposa dues tècniques de particionat per a la LLC, les quals atorguen més espai de memòria cau a les aplicacions que progressen més lentament a causa de la interferència entre aplicacions. La primera proposta per al particionat de memòria cau requereix hardware específic no disponible en processadors comercials, per la qual cosa s'ha avaluat utilitzant un entorn de simulació. La segona proposta de particionat per a memòries cau presenta una família de polítiques que superen les limitacions en el nombre de particions i en el nombre de vies de memòria cau disponibles mitjan¿ cant l'agrupació d'aplicacions en clústers i la superposició de particions de memòria cau, de manera que diverses aplicacions comparteixen les mateixes vies. Atès que s'ha implementat utilitzant els mecanismes per al particionat de la LLC que ofereixen alguns processadors Intel moderns, aquesta proposta s'ha avaluat en una màquina real. Els resultats experimentals mostren que el mecanisme de prebusca selectiva proposat en aquesta tesi redueix el nombre de sol·licituds a la memòria principal en un 20%, cosa que es tradueix en millores en l'equitat del sistema, el rendiment i el consum d'energia. Per altra banda, pel que fa als esquemes de particiónat proposats, en comparació amb un sistema sense particions, ambdues propostes redueixen la iniquitat del sistema en més d'un 25% de mitjana, independentment de la quantitat d'aplicacions en execució, i aquesta reducció en la iniquitat no afecta negativament el rendiment.Selfa Oliver, V. (2018). Adaptive Prefetching and Cache Partitioning for Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112423TESI

    Adaptive prefetch for multicores

    Full text link
    [EN] Current multicore systems implement various hardware prefetchers since prefetching can significantly hide the huge main memory latencies. However, memory bandwidth is a scarce resource which becomes critical with the increasing core count.Therefore, prefetchers must smartly regulate their aggressiveness to make an efficient use of this shared resource. Recent research has proposed to throttle up/down the prefetcher aggressiveness level, considering local and global system information gathered at the memory controller. However, in memory-hungry mixes, keeping active the prefetchers even with the lowest aggressiveness can, in some cases, damage the system performance and increase the energy consumption. This Master's Thesis proposes the ADP prefetcher, which, unlike previous proposals, turns off the prefetcher in specific cores when no local benefits are expected or it is adversely interfering with other cores. The key component of ADP is the activation policy which must foresee when prefetching will be beneficial without the prefetcher being active. The proposed policies are orthogonal to the prefetcher mechanism implemented in the microprocessor. The proposed prefetcher improves both performance and energy with respect to a state-of-the-art adaptive prefetcher in both memory-bandwidth hungry workloads and in workloads combining memory hungry with CPU intensive applications. Compared to a state-of-the-art prefetcher, the proposal almost halves the increase in main memory requests caused by prefetching while improving the performance by 4.46% on average, and with significantly less DRAM energy consumption[ES] Los sistemas multinúcleo actuales implementan diversos mecanismos hardware de prebúsqueda ya que contribuyen a ocultar significativamente las enormes latencias de los accesos a memoria principal. No obstante, el ancho de banda de memoria es un recurso escaso, y se convierte en un importante cuello de botella al incrementarse el número de núcleos. Por lo tanto, los mecanismos de prebúsqueda deben regular inteligentemente su agresividad para hacer un uso eficiente de este recurso compartido. Otro trabajos recientes proponen mecanismos para ajustar el nivel de agresividad del mecanismo de prebúsqueda, teniendo en cuenta tanto información local al núcleo como información global obtenida del controlador de memoria. Sin embargo, en cargas con un gran número de accesos a memoria, mantener activa la prebúsqueda, incluso con los niveles de agresividad más bajos puede, en algunos casos, perjudicar el rendimiento del sistema y aumentar el consumo de energía. Esta Tesina de Master propone ADP, un novedoso mecanismo de prebúsqueda adaptativa que, a diferencia de propuestas anteriores, se desactiva en los núcleos en los que no se espera que mejore las prestaciones o en los que está interfiriendo negativamente con otros núcleos. El componente clave de ADP es la política de activación, ya que debe de ser capaz de prever cuando la prebúsqueda va a aportar beneficios sin que esta esté activa. Además, las políticas propuestas son ortogonales al mecanismo de prebúsqueda implementado en el microprocesador. El mecanismo de prebúsqueda propuesto mejora tanto el rendimiento como el consumo de energía con respecto al estado del arte en prebúsqueda adaptativa, tanto en cargas con un alto consumo de ancho de banda como en cargas que combinan este tipo de cargas con otras que hacen un uso intensivo de la CPU. En comparación, nuestra propuesta reduce prácticamente a la mitad el aumento en los accesos a memoria principal causados por la prebúsqueda. Esta mejora en el rendimiento, de media un 4,46 \%, se consigue con una significativa reducción en el consumo de energía. (español) Current multicore systems implement various hardware prefetchers since prefetching can significantly hide the huge main memory latencies. However, memory bandwidth is a scarce resource which becomes critical with the increasing core count.Therefore, prefetchers must smartly regulate their aggressiveness to make an efficient use of this shared resource.Selfa Oliver, V. (2014). Adaptive prefetch for multicores. http://hdl.handle.net/10251/59527Archivo delegad

    A Research-Oriented Course on Advanced Multicore Architecture

    Full text link
    ©2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Multicore processors have become ubiquitous in our real life in devices like smartphones, tablets, etc. In fact, they are present in almost all segments of the computing market, from supercomputers to embedded devices. The huge market competence have lead industry and academia to develop vertiginous technological and architectural advances. The fast evolution that are still experiencing current multicores makes difficult for instructors to offer computer architecture courses with updated contents, preferably showing the industry and academia research trends. To deal with this shortcoming, authors consider that a research-oriented course is the most appropriate solution. This paper presents an advanced computer architecture course called Advanced Multicore Architectures, offered in 2015. The course covers the basic topics of multicore architecture and has been organized in four main modules regarding multicore basis, performance evaluation, advanced caching, and main memory organization. The course follows a research-oriented approach that covers theoretical concepts at lectures in which recent research papers are analyzed to provide students a wide view of current trends. Moreover, additional teaching methods like lab sessions with a state-of-the-art multicore simulator or research-oriented exercises have been used with the aim of introducing students to research in these topics. To achieve this fully research-oriented methodology, about 40% of the time is devoted to labs and exercises.This work was supported by the Spanish Ministerio de Economía y Competitividad (MINECO) and by FEDER funds under Grant TIN2012-38341-C04-01, and by the Intel Early Career Faculty Honor Program Award.Sahuquillo Borrás, J.; Petit Martí, SV.; Selfa Oliver, V.; Gómez Requena, ME. (2015). A Research-Oriented Course on Advanced Multicore Architecture. IEEE Computer Society. https://doi.org/10.1109/IPDPSW.2015.46

    A research-oriented course on Advanced Multicore Architecture: Contents and active learning methodologies

    Full text link
    [EN] The fast evolution of multicore processors makes it difficult for professors to offer computer architecture courses with updated contents. To deal with this shortcoming that could discourage students, the most appropriate solution is a research-oriented course based on current microprocessor industry trends. Additionally, we also seek to improve the students' skills by applying active learning methodologies, where teachers act as guiders and resource providers while students take the responsibility for their learning. In this paper, we present the Advanced Multicore Architecture (AMA) course, which follows a research-oriented approach to introduce students in architectural breakthroughs and uses active learning methodologies to enable students to develop practical research skills such as critical analysis of research papers or communication abilities. To this end five main activities are used: (i) lectures dealing with key theoretical concepts, (ii) paper review & discussion, (iii) research-oriented practical exercises, (iv) lab sessions with a state-of-the-art multicore simulator, and (v) paper presentation. An important part of all these activities is driven by active learning methodologies. Special emphasis is put on the practical side by allocating 40% of the time to labs and exercises. This work also includes an assessment study that analyzes both the course contents and the used methodology (both of them compared to other courses).This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and by Plan E funds under Grant TIN2014-62246-EXP and Grant TIN2015-66972-C5-1-R, and by Generalitat Valenciana under grant AICO/2016/059. Authors also would like to thank Onur Mutlu for making available online his valuable teaching material.Petit Martí, SV.; Sahuquillo Borrás, J.; Gómez Requena, ME.; Selfa-Oliver, V. (2017). A research-oriented course on Advanced Multicore Architecture: Contents and active learning methodologies. Journal of Parallel and Distributed Computing. 105:63-72. https://doi.org/10.1016/j.jpdc.2017.01.011S637210

    A Hardware Approach to Fairly Balance the Inter-Thread Interference in Shared Caches

    Full text link
    [EN] Shared caches have become the common design choice in the vast majority of modern multi-core and many-core processors, since cache sharing improves throughput for a given silicon area. Sharing the cache, however, has a downside: the requests from multiple applications compete among them for cache resources, so the execution time of each application increases over isolated execution. The degree in which the performance of each application is affected by the interference becomes unpredictable yielding the system to unfairness situations. This paper proposes Fair-Progress Cache Partitioning (FPCP), a low-overhead hardware-based cache partitioning approach that addresses system fairness. FPCP reduces the interference by allocating to each application a cache partition and adjusting the partition sizes at runtime. To adjust partitions, our approach estimates during multicore execution the time each application would have taken in isolation, which is challenging. The proposed approach has two main differences over existing approaches. First, FPCP distributes cache ways incrementally, which makes the proposal less prone to estimation errors. Second, the proposed algorithm is much less costly than the state-of-the-art ASM-Cache approach. Experimental results show that, compared to ASM-Cache, FPCP reduces unfairness by 48 percent in four-application workloads and by 28 percent in eight-application workloads, without harming the performance.This work was supported in part by the Spanish Ministerio de Economia y Competitividad (MINECO) and Plan E funds, under grants TIN2014-62246-EXP and TIN2015-66972-C5-1-R.Selfa-Oliver, V.; Sahuquillo Borrás, J.; Petit Martí, SV.; Gómez Requena, ME. (2017). A Hardware Approach to Fairly Balance the Inter-Thread Interference in Shared Caches. IEEE Transactions on Parallel and Distributed Systems. 28(11):3021-3032. https://doi.org/10.1109/TPDS.2017.2713778S30213032281

    Prebúsqueda adaptativa para procesadores multinúcleo

    Full text link
    El propósito de este proyecto es diseñar y evaluar mediante simulación un mecanismo dinámico de activación y desactivación de la prebúsqueda de manera independiente para cada uno de los núcleos del microprocesador. Todo ello sobre la base del simulador Multi2Sim [USPL07], sobre el que se ha implementado un mecanismo de prebúsqueda como los utilizados en los procesadores actuales. El mecanismo propuesto usa estadísticas y contadores disponibles en los núcleos junto con otros de muy sencilla implementación con el propósito de inferir si la prebúsqueda está funcionando adecuadamente o no y actuar en consecuencia.Selfa Oliver, V. (2013). Prebúsqueda adaptativa para procesadores multinúcleo. http://hdl.handle.net/10251/32453.Archivo delegad

    Efficient Selective Multicore Prefetching under Limited Memory Bandwidth

    Full text link
    [EN] Current multicore systems implement multiple hardware prefetchers to tolerate long main memory latencies. However, memory bandwidth is a scarce shared resource which becomes critical with the increasing core count. To deal with this fact, recent works have focused on adaptive prefetchers, which control the prefetcher aggressiveness to regulate the main memory bandwidth consumption. Nevertheless, in limited bandwidth machines or under memory-hungry workloads, keeping active the prefetcher can damage the system performance and increase energy consumption. This paper introduces selective prefetching, where individual prefetchers are activated or deactivated to improve both main memory energy and performance, and proposes ADP, a prefetcher that deactivates local prefetchers in some cores when they present low performance and co-runners need additional bandwidth. Based on heuristics, an individual prefetcher is reactivated when performance enhancements are foreseen. Compared to a state-of-the-art adaptive prefetcher, ADP provides both performance and energy enhancements in limited memory bandwidth. (C) 2018 Elsevier Inc. All rights reserved.Selfa-Oliver, V.; Sahuquillo Borrás, J.; Gómez Requena, ME.; Gómez Requena, C. (2018). Efficient Selective Multicore Prefetching under Limited Memory Bandwidth. Journal of Parallel and Distributed Computing. 120:32-43. https://doi.org/10.1016/j.jpdc.2018.05.002S324312

    Phase-Aware Cache Partitioning to Target both Turnaround Time and System Performance

    Full text link
    CPA is LLC (Last Level Cache) partitioning approach that performs an efficient cache space distribution among executing applications. To assign partitions (ways) of the LLC, Intel CAT is used. This policy is included in a framework named manager which is able to launch the experiments.This work has been partially supported by Ministerio de Ciencia, Innovación y Universidades and the European ERDF under Grant RTI2018-098156-B-C51, and Generalitat Valenciana under Grant AICO/2019/317.Pons Escat, L.; Selfa Oliver, V.; Sahuquillo Borrás, J.; Petit Martí, SV.; Pons Terol, J. (2020). Critical Phase-Aware Partitioning Approach (CPA). Universitat Politècnica de València. http://hdl.handle.net/10251/14318

    Make the Most out of Last Level Cache in Intel Processors

    No full text
    In modern (Intel) processors, Last Level Cache (LLC) is divided into multiple slices and an undocumented hashing algorithm (aka Complex Addressing) maps different parts of memory address space among these slices to increase the effective memory bandwidth. After a careful study of Intel’s Complex Addressing, we introduce a slice-aware memory management scheme, wherein frequently used data can be accessed faster via the LLC. Using our proposed scheme, we show that a key-value store can potentially improve its average performance ∼12.2% and ∼11.4% for 100% & 95% GET workloads, respectively. Furthermore, we propose CacheDirector, a network I/O solution which extends Direct Data I/O (DDIO) and places the packet’s header in the slice of the LLC that is closest to the relevant processing core. We implemented CacheDirector as an extension to DPDK and evaluated our proposed solution for latency-critical applications in Network Function Virtualization (NFV) systems. Evaluation results show that CacheDirector makes packet processing faster by reducing tail latencies (90-99th percentiles) by up to 119 µs (∼21.5%) for optimized NFV service chains that are running at 100 Gbps. Finally, we analyze the effectiveness of slice-aware memory management to realize cache isolationQC 20190226Time-Critical CloudsULTRAWAS
    corecore